Introduction

When fitting regression models with multiple explanatory variables, the interpretation of an explanatory variable is made in association with the other variables. For example, if we wanted to model income then we may consider an individual’s level of education, and perhaps the wealth of their parents. Then, when interpreting the effect an individuals level of education has on their income, we would also be considering the effect of the wealth of their parents simultaneously, as these two variables are likely to be related.

Modelling with Two Continuous Covariates

The regression model we will be considering contains the following variables:

Exploratory Data Analysis

What is the mean credit Limit?

  • Mean Credit Limit = 4735.6

What is the median credit Balance?

  • Median Credit Balance = 459.5

What is the percent credit card holders with income greater than $57,470?

  • Mean Credit Limit = 0.25

What is the correlation coefficient for the linear relationship between Balance and Limit?

  • Cor(Balance, Limit) = 0.8616973

What would be the verbal interpretation of the correlation coefficient for the linear relationship between Balance and Income?

  • Weakly Positive

Collinearity (or multicolinnearity) occurs when an explanatory variable within a multiple regression model can be linearly predicted from the other explanatory variables with a high level of accuracy. For example, in this case, since Limit and Income are highly correlated, we could take a good guess as to an individual’s Income based on their Limit. That is, having one or momre higly correlated explantory variables within a multiple regression model essentially provides us with redundant information. Normally, we would remove one of the highly correlated variables, but for the purpose of this example we will ignore the potenital issue.

p1 <- ggplot(credit, aes(x = Limit, y = Balance)) +
         geom_point() +
         labs(x = "Credit limit [$]", 
              y = "Credit card balance [$]",
              title = "Relationship between balance and credit limit") +
         geom_smooth(method = "lm", se = FALSE)
p2 <- ggplot(credit, aes(x = Income, y = Balance)) +
         geom_point() +
         labs(x = "Credit income [$]", 
              y = "Credit card balance [$]",
              title = "Relationship between income and income") +
         geom_smooth(method = "lm", se = FALSE)

grid.arrange(p1, p2, layout_matrix = matrix(seq_len(1*2), nrow = 1, ncol = 2))
\label{fig:scatter1}Relationship between balance and explanatory variables: credit limit and income.

Relationship between balance and explanatory variables: credit limit and income.

What is the relationship between balance and credit limit?

  • Positive

What is the relationship between balance and income?

  • Positive

The two scatterplots in Figure focus on the relationship between the outcome variable Balance and each of the explanatory variables independently. In order to get an idea of the relationship between all three variables we can use the plot_ly function within the plotly library to plot a 3D scatterplot as follows.

3D scatterplot between balance and explanatory variables: credit limit and income.

Formal Analysis

The multiple regression model we will be fitting to the credit balance data is given as:

\[y_i = \alpha + \beta_1x_{1i} + \beta_2x_{2i} + \epsilon_i, ~~~ \epsilon \sim N(0, \sigma^2)\]

where

  • \(y_i\) is the credit balance of the \(i^{ith}\) individual;
  • \(\alpha\) is the intercept and positions the best-fitting plane in 3D space;
  • \(\beta_1\) is the coefficient for the first explanatory variable \(x_1\);
  • \(\beta_2\) is the coefficient for the second explanatory variable \(x_2\);
  • \(\epsilon_i\) is the \(i^{th}\) random error component
Estimates of the parameters from the fitted linear regression model.
term estimate std_error statistic p_value lower_ci upper_ci
intercept -385.179 19.465 -19.789 0 -423.446 -346.912
Limit 0.264 0.006 44.955 0 0.253 0.276
Income -7.663 0.385 -19.901 0 -8.420 -6.906

Simpson’s Paradox: From Figure we see positive relationships between credit card balance against both credit limit and income. Why do then get a negative coefficient for income (\(\widehat{\beta_{income}} = -7.66\))? This is due to a phenomenon known as Simpson’s Paradox. This occurs when there are trends within different catagories (or groups) of data, but that these trends disappear when the categories are grouped as a whole.

Assessing Model Fit

Now we need to asses our model assumptions:

  1. The deterministic part of the model captures all the non-random structure in the data (residuals have mean zero)
  2. The scale of the variability of the residuals is constant at all values of the explanatory variables.
  3. The residuals are normally distributed.
  4. The residuals are independent.
  5. The values of the explanatory variables are recorded without error.

First, we need to obtain the fitted values and residuals from our regression model:

regression.points <- get_regression_points(balance.model)

Recall that get_regression_points provides us with values of the:

  • outcome variable \(y\) (balance)
  • explanatory variables \(x_1\) (Limit) and \(x_2\) (Income)
  • fitted values \(\widehat{y}\)
  • the residual error (\(y - \widehat{y}\))

We can asses our first two model assumptions by producing scatterplots of our residuals against each of our explanaotry variables.

p3 <- ggplot(regression.points, aes(x = Limit, y = residual)) +
         geom_point() +
         labs(x = "Credit limit [$]", 
              y = "Residual",
              title = "Residuals vs. credit limit") +
         geom_hline(yintercept = 0, col = "blue", size = 1)
p4 <- ggplot(regression.points, aes(x = Income, y = residual)) +
         geom_point() +
         labs(x = "Credit income [$]", 
              y = "Residual",
              title = "Residuals vs. Income") +
         geom_hline(yintercept = 0, col = "blue", size = 1)

grid.arrange(p3, p4, layout_matrix = matrix(seq_len(1*2), nrow = 2, ncol = 1))
\label{fig:scatter3}Residual plots of credit limit and income.

Residual plots of credit limit and income.